16 research outputs found
Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets
This work presents a text-to-audio-retrieval system based on pre-trained text
and spectrogram transformers. Our method projects recordings and textual
descriptions into a shared audio-caption space in which related examples from
different modalities are close. Through a systematic analysis, we examine how
each component of the system influences retrieval performance. As a result, we
identify two key components that play a crucial role in driving performance:
the self-attention-based audio encoder for audio embedding and the utilization
of additional human-generated and synthetic data sets during pre-training. We
further experimented with augmenting ClothoV2 captions with available keywords
to increase their variety; however, this only led to marginal improvements. Our
system ranked first in the 2023's DCASE Challenge, and it outperforms the
current state of the art on the ClothoV2 benchmark by 5.6 pp. [email protected]: submitted to DCASE Workshop 202
Low-Complexity Audio Embedding Extractors
Solving tasks such as speaker recognition, music classification, or semantic
audio event tagging with deep learning models typically requires
computationally demanding networks. General-purpose audio embeddings (GPAEs)
are dense representations of audio signals that allow lightweight, shallow
classifiers to tackle various audio tasks. The idea is that a single complex
feature extractor would extract dense GPAEs, while shallow MLPs can produce
task-specific predictions. If the extracted dense representations are general
enough to allow the simple downstream classifiers to generalize to a variety of
tasks in the audio domain, a single costly forward pass suffices to solve
multiple tasks in parallel. In this work, we try to reduce the cost of GPAE
extractors to make them suitable for resource-constrained devices. We use
efficient MobileNets trained on AudioSet using Knowledge Distillation from a
Transformer ensemble as efficient GPAE extractors. We explore how to obtain
high-quality GPAEs from the model, study how model complexity relates to the
quality of extracted GPAEs, and conclude that low-complexity models can
generate competitive GPAEs, paving the way for analyzing audio streams on edge
devices w.r.t. multiple audio classification and recognition tasks.Comment: In Proceedings of the 31st European Signal Processing Conference,
EUSIPCO 2023. Source Code available at:
https://github.com/fschmid56/EfficientAT_HEA
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation
Audio Spectrogram Transformer models rule the field of Audio Tagging,
outrunning previously dominating Convolutional Neural Networks (CNNs). Their
superiority is based on the ability to scale up and exploit large-scale
datasets such as AudioSet. However, Transformers are demanding in terms of
model size and computational requirements compared to CNNs. We propose a
training procedure for efficient CNNs based on offline Knowledge Distillation
(KD) from high-performing yet complex transformers. The proposed training
schema and the efficient CNN design based on MobileNetV3 results in models
outperforming previous solutions in terms of parameter and computational
efficiency and prediction performance. We provide models of different
complexity levels, scaling from low-complexity models up to a new
state-of-the-art performance of .483 mAP on AudioSet. Source Code available at:
https://github.com/fschmid56/EfficientATComment: Submitted to ICASSP 2023. Source Code available at:
https://github.com/fschmid56/EfficientA
Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models
The introduction of large-scale audio datasets, such as AudioSet, paved the
way for Transformers to conquer the audio domain and replace CNNs as the
state-of-the-art neural network architecture for many tasks. Audio Spectrogram
Transformers are excellent at exploiting large datasets, creating powerful
pre-trained models that surpass CNNs when fine-tuned on downstream tasks.
However, current popular Audio Spectrogram Transformers are demanding in terms
of computational complexity compared to CNNs. Recently, we have shown that, by
employing Transformer-to-CNN Knowledge Distillation, efficient CNNs can catch
up with and even outperform Transformers on large datasets. In this work, we
extend this line of research and increase the capacity of efficient CNNs by
introducing dynamic CNN blocks, constructed of dynamic non-linearities, dynamic
convolutions and attention mechanisms. We show that these dynamic CNNs
outperform traditional efficient CNNs, in terms of the performance-complexity
trade-off and parameter efficiency, at the task of audio tagging on the
large-scale AudioSet. Our experiments further indicate that the introduced
dynamic CNNs achieve better performance on downstream tasks and scale up well,
attaining Transformer performance and even outperforming them on AudioSet and
several downstream tasks.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processing. Source Code available at:
https://github.com/fschmid56/EfficientA
Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers
The success of supervised deep learning methods is largely due to their
ability to learn relevant features from raw data. Deep Neural Networks (DNNs)
trained on large-scale datasets are capable of capturing a diverse set of
features, and learning a representation that can generalize onto unseen tasks
and datasets that are from the same domain. Hence, these models can be used as
powerful feature extractors, in combination with shallower models as
classifiers, for smaller tasks and datasets where the amount of training data
is insufficient for learning an end-to-end model from scratch. During the past
years, Convolutional Neural Networks (CNNs) have largely been the method of
choice for audio processing. However, recently attention-based transformer
models have demonstrated great potential in supervised settings, outperforming
CNNs. In this work, we investigate the use of audio transformers trained on
large-scale datasets to learn general-purpose representations. We study how the
different setups in these audio transformers affect the quality of their
embeddings. We experiment with the models' time resolution, extracted embedding
level, and receptive fields in order to see how they affect performance on a
variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge
evaluation setup. Our results show that representations extracted by audio
transformers outperform CNN representations. Furthermore, we will show that
transformers trained on Audioset can be extremely effective representation
extractors for a wide range of downstream tasks.Comment: will apear in HEAR: Holistic Evaluation of Audio Representations
Proceedings of Machine Learning Research PMLR 166. Source code:
https://github.com/kkoutini/passt_hear2
Rethinking data augmentation for adversarial robustness
Recent work has proposed novel data augmentation methods to improve the adversarial robustness of deep neural networks. In this paper, we re-evaluate such methods through the lens of different metrics that characterize the augmented manifold, finding contradictory evidence. Our extensive empirical analysis involving 5 data augmentation methods, all tested with an increasing probability of augmentation, shows that: (i) novel data augmentation methods proposed to improve adversarial robustness only improve it when combined with classical augmentations (like image flipping and rotation), and even worsen adversarial robustness if used in isolation; and (ii) adversarial robustness is significantly affected by the augmentation probability, conversely to what is claimed in recent work. We conclude by discussing how to rethink the development and evaluation of novel data augmentation methods for adversarial robustness. Our open-source code is available at https://github.com/eghbalz/rethink_da_for_a